Joining Extractions of Regular Expressions

نویسندگان

  • Dominik D. Freydenberger
  • Benny Kimelfeld
  • Liat Peterfreund
چکیده

Regular expressions with capture variables, also knownas “regex formulas,” extract relations of spans (inter-val positions) from text. These relations can be fur-ther manipulated via Relational Algebra as studied inthe context of document spanners, Fagin et al.’s for-mal framework for information extraction. We investigate the complexity of querying text by ConjunctiveQueries (CQs) and Unions of CQs (UCQs) on top ofregex formulas. We show that the lower bounds (NP-completeness and W[1]-hardness) from the relationalworld also hold in our setting; in particular, hardnesshits already single-character text! Yet, the upper boundsfrom the relational world do not carry over. Unlike therelational world, acyclic CQs, and even gamma-acyclicCQs, are hard to compute. The source of hardness isthat it may be intractable to instantiate the relationdefined by a regex formula, simply because it has anexponential number of tuples. Yet, we are able to es-tablish general upper bounds. In particular, UCQs canbe evaluated with polynomial delay, provided that everyCQ has a bounded number of atoms (while unions andprojection can be arbitrary). Furthermore, UCQ evalu-ation is solvable with FPT (Fixed-Parameter Tractable)delay when the parameter is the size of the UCQ.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Discrete Time Analysis of Multi-Server Queueing System with Multiple Working Vacations and Reneging of Customers‎

This paper analyzes a discrete-time $Geo/Geo/c$ queueing system with multiple working vacations and reneging in which customers arrive according to a geometric process. As soon as the system gets empty, the servers go to a working vacations all together. The service times during regular busy period, working vacation period and vacation times are assumed to be geometrically distributed. Customer...

متن کامل

Transformation Between Regular Expressions and ω-Automata

We propose a new definition of regular expressions for describing languages of ω-words, called∞regular expressions. These expressions are obtained by adding to the standard regular expression on finite words an operator ∞ that acts similar to the Kleene-star but can be iterated finitely or infinitely often (as opposed to the ω-operator from standard ω-regular expressions, which has to be iterat...

متن کامل

Derivatives for Enhanced Regular Expressions

Regular languages are closed under a wealth of formal language operators. Incorporating such operators in regular expressions leads to concise language specifications, but the transformation of such enhanced regular expressions to finite automata becomes more involved. We present an approach that enables the direct construction of finite automata from regular expressions enhanced with further o...

متن کامل

Obtaining shorter regular expressions from finite-state automata

We consider the use of state elimination to construct shorter regular expressions from finite-state automata (FAs). Although state elimination is an intuitive method for computing regular expressions from FAs, the resulting regular expressions are often very long and complicated. We examine the minimization of FAs to obtain shorter expressions first. Then, we introduce vertical chopping based o...

متن کامل

Shorter Regular Expressions from Finite-State Automata

We consider the use of state elimination to construct shorter regular expressions from finite-state automata. Although state elimination is an intuitive method for computing regular expressions from finitestate automata, the resulting regular expressions are often very long and complicated. We examine the minimization of finite-state automata to obtain shorter expressions first. Then, we introd...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره abs/1703.10350  شماره 

صفحات  -

تاریخ انتشار 2017